Validating Sequence Analysis Typologies Using Parametric Bootstrap

Post by Matthias Studer

clustervalid.knit

Introduction

This document provides a very quick introduction to the R code needed to use parametric bootstraps for typology validation in sequence analysis. Readers interested in the methods and the exact interpretation of the results are referred to:

Creating the State Sequence Object

For this example, we will use the mvad dataset. Let’s start with the creation of the state sequence object.

## Loading the TraMineR library
library(TraMineR)
## Loading the data
data(mvad)

## State properties
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.shortlab <- c("EM","FE","HE","JL","SC","TR")

## Creating the state sequence object
mvad.seq <- seqdef(mvad, 17:86, alphabet = mvad.alphabet, states = mvad.shortlab, labels = mvad.lab, xtstep = 6)

Creating the typology

We will now create a typology using cluster analysis. Readers interested in more detail are referred to the WeightedCluster library manual, which goes into the details of the creation and computation of cluster quality measures.

We start by computing dissimilarities with the seqdist function and the Hamming distance. We then use Ward clustering to create a typology of the trajectories. For this step, we recommend the use of the fastcluster library, which considerably speed up the computations.

## Using fastcluster for hierarchical clustering
library(fastcluster)
## Distance computation
diss <- seqdist(mvad.seq, method="HAM")
## Hierarchical clustering
hc <- hclust(as.dist(diss), method="ward.D")

We can now compute several cluster quality indices using as.clustrange function from two to ten groups.

# Loading the WeightedCluster library
library(WeightedCluster)
# Computing cluster quality measures.
clustqual <- as.clustrange(hc, diss=diss, ncluster=10)
clustqual
##            PBC   HG HGSD  ASW ASWw     CH   R2   CHsq R2sq   HC
## cluster2  0.55 0.71 0.69 0.39 0.39 184.57 0.21 347.47 0.33 0.14
## cluster3  0.49 0.57 0.57 0.27 0.27 139.53 0.28 261.70 0.42 0.22
## cluster4  0.46 0.58 0.57 0.24 0.24 124.40 0.35 228.94 0.49 0.23
## cluster5  0.51 0.68 0.67 0.27 0.28 118.20 0.40 241.78 0.58 0.18
## cluster6  0.52 0.70 0.70 0.28 0.29 117.08 0.45 241.96 0.63 0.17
## cluster7  0.55 0.78 0.78 0.32 0.32 114.14 0.49 245.83 0.68 0.13
## cluster8  0.56 0.82 0.82 0.32 0.33 109.22 0.52 244.85 0.71 0.11
## cluster9  0.57 0.85 0.85 0.34 0.35 105.74 0.55 255.12 0.74 0.09
## cluster10 0.55 0.85 0.84 0.32 0.33 101.81 0.57 244.05 0.76 0.10

Parametric Bootstrap

Parametric bootstrap aims to provide baseline values obtained by clustering similar but non-clustered data. This can be computed using the seqnullcqi function with the following parameters:

  • R: number of bootstraps.
  • model: The null model (see table below
  • seqdist.args: list of arguments passed to seqdist (should be identical to first call to seqdist).
  • hclust.method: hierarchical clustering method (should be identical to orginal clustering).
  • kmedoid: If TRUE, use PAM (and the wcKMedRange function) instead of hierarchical clustering.
  • parallel: If TRUE, use parallel computing to speed up the computations.

Combined Randomization

The following R code estimate expected values of the cluster quality indices when clustering similar sequences that are not clustered according to the "combined" model, using the Hamming distance and Ward hierarchical clustering. We set parallel=TRUE to use parallel computing. You can use progressbar=TRUE to show a progress bar of the computations (not meaningful here within a document)

bcq.combined <- seqnullcqi(mvad.seq, clustqual, R=1000, model="combined", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)

Once the parametric bootstrap is computed (may take a while…), the results are stored in the bcq.combined object. Printing the object (just by writing its name), already provides several information, the standardized cluster quality indices and the inconclusive interval (standardized values).

bcq.combined
## Parametric bootstrap cluster analysis validation
## Sequence analysis null model: list(model = "combined") 
## Number of bootstraps: 1000 
## Clustering method: hclust with ward.D 
## Seqdist arguments: list(method = "HAM") 
## 
## 
##                                   PBC           HG      HGSD           ASW
## cluster2                         6.15         6.06      6.15         15.14
## cluster3                         4.28         4.31       4.4          8.23
## cluster4                         2.28         2.65      2.72           5.3
## cluster5                         2.11          2.4      2.46          6.05
## cluster6                         1.33         1.67       1.7           5.5
## cluster7                         2.29         3.12      3.16          8.29
## cluster8                         3.09         4.12      4.15           9.8
## cluster9                         3.86         4.68      4.71         11.81
## cluster10                        3.24         4.23      4.26         11.69
##                                                                           
## Null Max-T 0.95 interval [-0.08; 2.3] [0.01; 2.15] [0; 2.16] [-0.22; 2.33]
##                                   ASWw            CH            R2         CHsq
## cluster2                         15.06         29.07         24.65        32.36
## cluster3                          8.18         24.52          19.9        24.89
## cluster4                          5.26         23.67         18.49        20.78
## cluster5                          6.01         24.59         18.46        22.79
## cluster6                          5.45         26.91         19.27        22.57
## cluster7                          8.23         30.77          21.2        26.43
## cluster8                          9.72         33.57          22.6        29.86
## cluster9                          11.7         36.88          24.2        35.71
## cluster10                        11.55         39.15          25.2        36.89
##                                                                                
## Null Max-T 0.95 interval [-0.23; 2.33] [-1.16; 2.57] [-1.16; 2.49] [-1.1; 2.61]
##                                  R2sq            HC
## cluster2                        24.15         -5.18
## cluster3                        17.56         -4.75
## cluster4                        14.23         -3.23
## cluster5                        14.22         -3.22
## cluster6                        13.39          -2.8
## cluster7                        14.71         -4.84
## cluster8                        15.91          -6.4
## cluster9                        17.68         -7.47
## cluster10                       18.11         -7.36
##                                                    
## Null Max-T 0.95 interval [-1.1; 2.49] [-0.15; 2.77]

Looking at the numbers, 2, 9 and 10 groups stand out. To get non-standardized values, use norm=FALSE. Notice that inconclusive ASW intervals are well below the recommended values (over 0.5).

print(bcq.combined, norm=FALSE)
## Parametric bootstrap cluster analysis validation
## Sequence analysis null model: list(model = "combined") 
## Number of bootstraps: 1000 
## Clustering method: hclust with ward.D 
## Seqdist arguments: list(method = "HAM") 
## 
## 
##                                   PBC           HG         HGSD         ASW
## cluster2                         0.55         0.71         0.69        0.39
## cluster3                         0.49         0.57         0.57        0.27
## cluster4                         0.46         0.58         0.57        0.24
## cluster5                         0.51         0.68         0.67        0.27
## cluster6                         0.52          0.7          0.7        0.28
## cluster7                         0.55         0.78         0.78        0.32
## cluster8                         0.56         0.82         0.82        0.32
## cluster9                         0.57         0.85         0.85        0.34
## cluster10                        0.55         0.85         0.84        0.32
##                                                                            
## Null Max-T 0.95 interval [0.45; 0.54] [0.69; 0.78] [0.68; 0.78] [0.14; 0.2]
##                                  ASWw             CH           R2          CHsq
## cluster2                         0.39         184.57         0.21        347.47
## cluster3                         0.27         139.53         0.28         261.7
## cluster4                         0.24          124.4         0.35        228.94
## cluster5                         0.28          118.2          0.4        241.78
## cluster6                         0.29         117.08         0.45        241.96
## cluster7                         0.32         114.14         0.49        245.83
## cluster8                         0.33         109.22         0.52        244.85
## cluster9                         0.35         105.74         0.55        255.12
## cluster10                        0.33         101.81         0.57        244.05
##                                                                                
## Null Max-T 0.95 interval [0.15; 0.21] [40.07; 56.73] [0.31; 0.34] [74.6; 99.79]
##                                  R2sq           HC
## cluster2                         0.33         0.14
## cluster3                         0.42         0.22
## cluster4                         0.49         0.23
## cluster5                         0.58         0.18
## cluster6                         0.63         0.17
## cluster7                         0.68         0.13
## cluster8                         0.71         0.11
## cluster9                         0.74         0.09
## cluster10                        0.76          0.1
##                                                   
## Null Max-T 0.95 interval [0.48; 0.53] [0.33; 0.44]

Several plots can then be used to inspect the results using the plot command and the type argument. First, one can look at the sequences generated by the null model by using type="seqdplot".

plot(bcq.combined, type="seqdplot")

The overall distribution of the CQI values can be plotted using type="density". In this case, one also needs to specify the CQI to be used. All tested number of groups are found to be significant. Any CQI computed by as.clustrange() can be used here. To show the density of the average silhouette width ("ASW"), one can use:

plot(bcq.combined, stat="ASW", type="density")

By using type="line", we plot the obtained and bootstrapped CQI values depending on the number of groups. Here again

plot(bcq.combined, stat="ASW", type="line")

Randomized Sequencing

To use another null model, one only needs to change the model argument of the seqnullcqi function. The randomized sequencing keep the duration attached to each state, but randomizes the ordering of the spells. It can be used to uncover sequencing structure of the data.

bcq.seq <- seqnullcqi(mvad.seq, clustqual, R=1000, model="sequencing", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)

We can then plot the results as before. Notice that solutions between 3 and 6 are below the critical line.

plot(bcq.seq, stat="ASW", type="line")

Randomized Duration

The randomized duration keeps the same ordering of the states, but randomizes the time spent in each spell. It can be used to uncover the duration-related structure of the data.

bcq.dur <- seqnullcqi(mvad.seq, clustqual, R=1000, model="duration", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)

We can then plot the results as before. The solutions 3 and 4 groups solutions are below the “significance line”. Otherwise, the ranking of the solutions is the same.

plot(bcq.dur, stat="ASW", type="line")

State Independence

The state independence null model generates sequence, position by position, independently of the previous state. This is a quite unrealistic assumption for longitudinal data.

bcq.stateindep <- seqnullcqi(mvad.seq, clustqual, R=1000, model="stateindep", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)

Bootstrapped CQI values are extremely low compared to our clustering, meaning that we have a strong longitudinal structure (not surprising!).

plot(bcq.stateindep, stat="ASW", type="line")

First-order Markov Null Model

The first-order Markov null model generates sequences using time-invariant transition rates. As a result, the generated sequences are often quite similar to the observed ones. This model can uncover structure stemming from time-dependent transition rates.

bcq.Markov <- seqnullcqi(mvad.seq, clustqual, R=1000, model="Markov", seqdist.args=list(method="HAM"), hclust.method="ward.D", parallel = TRUE)
plot(bcq.Markov, stat="ASW", type="line")

Choosing a Solution

The various null models lead to the same conclusions and ranking of the solutions. Solutions between 3 and 6 groups were not always above the critical lines (in the sequencing null model for instance), and can be avoided. We generally saw good clustering quality for a clustering in 9 groups. The solution is shown below.

seqdplot(mvad.seq, clustqual$clustering$cluster9, border=NA)

Leave a Reply